Filename: 302-padding-machines-for-onion-clients.txt
Title: Hiding onion service clients using padding
Author: George Kadianakis, Mike Perry
Created: Thursday 16 May 2019
Status: Closed
Implemented-In: 0.4.1.1-alpha

NOTE: Please look at section 3 of padding-spec.txt now, not this document.

0. Overview

   Tor clients use "circuits" to do anonymous communications. There are various
   types of circuits. Some of them are for navigating the normal Internet,
   others are for fetching Tor directory information, others are for connecting
   to onion services, while others are simply for measurements and testing.

   It's currently possible for MITM type of adversaries (like tor-network-level
   and local-area-network adversaries) to distinguish Tor circuit types from
   each other using a wide array of metadata and distinguishers.

   In this proposal, we study various techniques that can be used to
   distinguish client-side onion service circuits and provide WTF-PAD circuit
   padding machines (using prop#254) to hide them against certain adversaries.

1. Motivation

   We are writing this proposal for various reasons:

   1) We believe that in an ideal setting MITM adversaries should not be able
      to distinguish circuit types by inspecting traffic. Tor traffic should
      look amorphous to an outside observer to maximize uncertainty and
      anonymity properties.

      Client-side onion service circuits are an easy target for this proposal,
      because we believe we can improve their privacy with low bandwidth
      overhead.

   2) We want to start experimenting with the WTF-PAD subsystem of Tor, and
      this use-case provides us with a good testbed.

   3) We hope that by actually starting to use the WTF-PAD subsystem of Tor, we
      will encourage more researchers to start experimenting with it.

2. Scope of the proposal [SCOPE]

   Given the above, this proposal sets forth to use the WTF-PAD system to hide
   client-side onion service circuits against the classifiers of paper by Kwon
   et al. above.

   By client-side onion service circuits we refer to these two types of circuits:
      - Client-side introduction circuits: Circuit from client to the introduction point
      - Client-side rendezvous circuits: Circuit from client to the rendezvous point

   Service-side onion service circuits are not in scope for this proposal, and
   this is because hiding those would require more bandwidth and also more
   advanced WTF-PAD features.

   Furthermore, this proposal only aims to cloak the naive distinguishing
   features mentioned in the [KNOWN_DISTINGUISHERS] section, and can by no
   means guarantee that client-side onion service circuits are totally
   indistinguishable by other means.

   The machines specified in this proposal are meant to be lightweight and
   created for a specific purpose. This means that they can be easily extended
   with additional states to do more advanced hiding.

3. Known distinguishers against onion service circuits [KNOWN_DISTINGUISHERS]

   Over the past years it's been assumed that motivated adversaries can
   distinguish onion-service traffic from normal Tor traffic given their
   special characteristics.

   As far as we know, there has been relatively little research-level work done
   to this direction. The main article published in this area is the USENIX
   paper "Circuit Fingerprinting Attacks: Passive Deanonymization of Tor Hidden
   Services" by Kwon et al. [0]

   The above paper deals with onion service circuits in sections 3.2 and 5.1.
   It uses the following three "naive" circuit features to distinguish circuits:
      1) Circuit construction sequence
      2) Number of incoming and outgoing cells
      3) Duration of Activity ("DoA")

    All onion service circuits have particularly loud signatures to the above
    characteristics, but WTF-PAD (prop#254) gives us tools to effectively
    silence those signatures to the point where the paper's classifiers won't
    work.

4. Hiding circuit features using WTF-PAD

   According to section [KNOWN_DISTINGUISHERS] there are three circuit features
   we are attempting to hide. Here is how we plan to do this using the WTF-PAD
   system:

   1) Circuit construction sequence

      The USENIX paper uses the directions of the first 10 cells sent in a
      circuit to fingerprint them. Client-side onion service circuits have
      unique circuit construction sequences and hence they can be fingeprinted
      using just the first 10 cells.

      We use WTF-PAD to destroy this feature of onion service circuits by
      carefully sending padding cells (relay DROP cells) during circuit
      construction and making them look exactly like most general tor circuits
      up till the end of the circuit construction sequence.

   2) Number of incoming and outgoing cells

      The USENIX paper uses the amount of incoming and outgoing cells to
      distinguish circuit types. For example, client-side introduction circuits
      have the same amount of incoming and outgoing cells, whereas client-side
      rendezvous circuits have more incoming than outgoing cells.

      We use WTF-PAD to destroy this feature by changing the number of cells
      sent in introduction circuits. We leave rendezvous circuits as is, since
      the actual rendezvous traffic flow usually resembles well normal Tor
      circuits.

    3) Duration of Activity ("DoA")

      The USENIX paper uses the period of time during which circuits send and
      receive cells to distinguish circuit types. For example, client-side
      introduction circuits are really short lived, wheras service-side
      introduction circuits are very long lived. OTOH, rendezvous circuits have
      the same median lifetime as general Tor circuits which is 10 minutes.

      We use WTF-PAD to destroy this feature of client-side introduction
      circuits by setting a special WTF-PAD option, which keeps the circuits
      open for 10 minutes completely mimicking the DoA of general Tor circuits.

4.1. A dive into general circuit construction sequences [CIRCCONSTRUCTION]

   In this section we give an overview of how circuit construction looks like
   to a network or guard-level adversary. We use this knowledge to make the
   right padding machines that can make intro and rend circuits look like these
   general circuits.

   In particular, most general Tor circuits used to surf the web or download
   directory information, start with the following 6-cell relay cell sequence (cells
   surrounded in [brackets] are outgoing, the others are incoming):

     [EXTEND2] -> EXTENDED2 -> [EXTEND2] -> EXTENDED2 -> [BEGIN] -> CONNECTED

   When this is done, the client has established a 3-hop circuit and also
   opened a stream to the other end. Usually after this comes a series of DATA
   cell that either fetches pages, establishes an SSL connection or fetches
   directory information:

     [DATA] -> [DATA] -> DATA -> DATA

   The above stream of 10 relay cells defines the grand majority of general
   circuits that come out of Tor browser during our testing, and it's what we
   are gonna use to make introduction and rednezvous circuits blend in.

   Please note that in this section we only investigate relay cells and not
   connection-level cells like CREATE/CREATED or AUTHENTICATE/etc. that are
   used during the link-layer handshake. The rationale is that connection-level
   cells depend on the type of guard used and are not an effective fingerprint
   for a network/guard-level adversary.

5. WTF-PAD machines

   For the purposes of this proposal we will make use of four WTF-PAD machines
   as follows:

      - Client-side introduction circuit hiding machine (origin-side)
      - Client-side introduction circuit hiding machine (relay-side)

      - Client-side rendezvous circuit hiding machine (origin-side)
      - Client-side rendezvous circuit hiding machine (relay-side)

   In the following sections we will analyze these machines.

5.1. Client-side introduction circuit hiding machines [INTRO_CIRC_HIDING]

   These two machines are meant to hide client-side introduction circuits. The
   origin-side machine sits on the client and sends padding towards the
   introduction circuit, whereas the relay-side machine sits on the middle-hop
   (second hop of the circuit) and sends padding towards the client. The
   padding from the origin-side machine terminates at the middle-hop and does
   not get forwarded to the actual introduction point.

   Both of these machines only get activated for introduction circuits, and
   only after an INTRODUCE1 cell has been sent out.

   This means that before the machine gets activated our cell flow looks like this:

    [EXTEND2] -> EXTENDED2 -> [EXTEND2] -> EXTENDED2 -> [EXTEND2] -> EXTENDED2 -> [INTRODUCE1]

   Comparing the above with section [CIRCCONSTRUCTION], we see that the above
   cell sequence matches the one from general circuits up to the first 7 cells.

   However, in normal introduction circuits this is followed by an
   INTRODUCE_ACK and then the circuit gets teared down, which does not match
   the sequence from [CIRCCONSTRUCTION].

   Hence when our machine is used, after sending an [INTRODUCE1] cell, we also
   send a [PADDING_NEGOTIATE] cell, which gets answered by a PADDING_NEGOTIATED
   cell and an INTRODUCE_ACKED cell. This makes us match the [CIRCCONSTRUCTION]
   sequence up to the first 10 cells.

   After that, we continue sending padding from the relay-side machine so as to
   fake a directory download, or an SSL connection setup. We also want to
   continue sending padding so that the connection stays up longer to destroy
   the "Duration of Activity" fingerprint.

   To calculate the padding overhead, we see that the origin-side machine just
   sends a single [PADDING_NEGOATIATE] cell, wheras the origin-side machine
   sends a PADDING_NEGOTIATED cell and between 7 to 10 DROP cells. This means
   that the average overhead of this machine is 11 padding cells.

   In terms of WTF-PAD terminology, these machines have three states (START,
   OBF, END). They move from the START to OBF state when the first
   non-padding cell is received on the circuit, and they stay in the OBF
   state until all the padding gets depleted. The OBF state is controlled by
   a histogram which specifies the parameters described in the paragraphs
   above. After all the padding finishes, it moves to END state.

   We also set a special WTF-PAD flag which keeps the circuit open even after
   the introduction is performed. In particular, with this feature the circuit
   will stay alive for the same durations as normal web circuits before they
   expire (usually 10 minutes).

5.2. Client-side rendezvous circuit hiding machines

   The rendezvous circuit machines apply on client-side rendezvous circuits and
   only after the rendezvous point has been established (REND_ESTABLISHED has
   been received). Up to that point, the following cell sequence has been
   observed on the circuit:

    [EXTEND2] -> EXTENDED2 -> [EXTEND2] -> EXTENDED2 -> [ESTABLISH_REND] -> REND_ESTABLISHED

   which matches the general circuit construction sequence [CIRCCONSTRUCTION]
   up to the first 6 cells. However after that, normal rendezvous circuits
   receive a RENDEZVOUS2 cell followed by a [BEGIN] and a CONNECTED, which does
   not fit the circuit construction sequence we are trying to imitate.

   Hence our machine gets activated right after REND_ESTABLISHED is received,
   and continues by sending a [PADDING_NEGOTIATE] and a [DROP] cell, before
   receiving a PADDING_NEGOTIATED and a DROP cell, effectively blending into
   the general circuit construction sequence on the first 10 cells.

   After that our machine gets deactivated, and we let the actual rendezvous
   circuit shape the traffic flow. Since rendezvous circuits usually immitate
   general circuits (their purpose is to surf the web), we can expect that they
   will look alike.

   In terms of overhead, this machine is quite light. Both sides send 2 padding
   cells, for a total of 4 padding cells.

6. Overhead analysis

   Given the parameters above, intro circuit machines have an overhead of 11
   padding cells, and rendezvous circuit machines have an overhead of 4
   cpadding ells.  . This means that for every intro and rendezvous circuit
   there will be an overhead of 15 padding cells in average, which is about
   7.5kb.

   In the PrivCount paper [1] we learn that the Tor network sees about 12
   million successful descriptor fetches per day. We can use this figure to
   assume that the Tor network also sees about 12 million intro and rendezvous
   circuits per day. Given the 7.5kb overhead of each of these circuits, we get
   that our padding machines infer an additional 94GB overhead per day on the
   network, which is about 3.9GB per hour.

   XXX Isn't this kinda intense????? Using the graphs from metrics we see that
       the Tor network has total capacity of 300 Gbit/s which is about 135000GB per
       hour, so 3.9GB per hour is not that much, but still...

7. Discussion

7.1. Alternative approaches

   These machines try to hide onion service client-side circuits by obfuscating
   their looks. This is a reasonable approach, but if the resulting circuits
   look unlike any other Tor circuits, they would still be fingerprintable just
   by that fact.

   Another approach we could take is make normal client circuits look like
   onion service circuits, or just make normal clients establish fake onion
   service circuits periodically. The hope here is that the adversary won't be
   able to distinguish fake onion service circuits from real ones. This
   approach has not been taken yet, mainly because it requires additional
   WTF-PAD features and poses greater overhead risks.

7.2. Future work

   As discussed in [SCOPE], this proposal only aims to hide some very specific
   features of client-side onion service circuits. There is lots of work to be
   done here to see what other features can be used to distinguish such
   circuits, and also what other classifiers can be built using deep learning
   and whatnot.

---

   [0]: https://www.usenix.org/node/190967
        https://blog.torproject.org/technical-summary-usenix-fingerprinting-paper

   [1]: "Understanding Tor Usage with Privacy-Preserving Measurement"
        by Akshaya Mani, T Wilson-Brown, Rob Jansen, Aaron Johnson, and Micah Sherr
        In Proceedings of the Internet Measurement Conference 2018 (IMC 2018).
